254        Bioinformatics

The analysis of the metagenomics data involves classification of an individual sequence

to a bacterial taxon or taxa. Bacteria are classified into hierarchical taxonomic groups in

descending order (kingdom, phylum, class, order, family, genus, species, and subspecies).

The analysis also includes construction of phylogenetic tree for the bacterial community,

quantification of bacterial presence in the sample (abundance), and the microbial diversity

in an individual sample and across samples. A number of bioinformatics tools are available

for analyzing metagenomic data.

In the amplicon-based metagenomics, the targeted region is a marker gene or any part

of a gene or any genomic region appropriate for microbial identification. The ideal tar-

get region is the one that includes a highly conserved sequence surrounded by less con-

served and it must be present ubiquitously in all target species and with available reference

sequences in the sequence databases. The 16S rRNA gene is one of the best candidate

marker genes. The 16S rRNA gene codes for a component of the 30S small subunit of the

bacterial ribosome. It is around 1500 bp consisting of nine conserved regions surrounding

hyper-variable regions. The sequences of the conserved regions are labeled as C1, C2, …,

C9 and the variable regions are labeled as V1, V2, …, V9. PCR primers are designed from

the conserved region close to the variable region so that the species-specific targeted region

is amplified and enriched by the PCR amplification. Then, the amplicons are sequenced

and analyzed.

Several samples are usually sequenced in a single run using multiplexing approach in

which unique barcode sequences are ligated to the DNA of each sample in the library

preparation step. After sequencing, the reads are demultiplexed by separating the reads of

the individual samples into separate FASTQ files before analysis. Since the amplicon-based

metagenomics depends on a targeted region of the genomes, it has less resolution than the

shotgun whole genome sequencing but is less expensive.

7.2  ANALYSIS WORKFLOW

In the following, we will discuss the steps of the workflow of the amplicon-based metage-

nomics data analysis, which include raw data preprocessing, read clustering, denoising

(error removal), taxonomic group assignment, construction of phylogenetic tree, and

diversity analysis.

7.2.1  Raw Data Preprocessing

After sequencing the targeted marker, raw sequence data is obtained in FASTA or FASTQ

format. In the case of FASTA format file produced by Sanger sequencing method, the per

base quality score may be provided in a separate file. The format of FASTQ files allows

base quality scores to be in the same file. The base quality scores reveal base call qual-

ity of each base and enable us to assess the sequence reads and to determine if the reads

require preprocessing before the analysis. The quality control step should not be taken

lightly. Errors in base calls may occur due to the library preparation or sequencing. The

reads of low-quality scores are usually filtered or truncated so that the errors do not affect

the final results. The preprocessing of the raw sequence data may also include demultiplex-

ing if multiple samples are sequenced in a single run. The demultiplexing step depends on